

#### Many-core Design for Data-Flow Execution

Using the SWitches Prototype Implementation

#### **Andreas Diavastos**

diavastos@cs.ucy.ac.cy

Advisor: Pedro Trancoso



Department of Computer Science, University of Cyprus CASPER Group: Computer Architecture, Systems and Performance Evaluation Research Group

#### Introduction



- Multi-cores:
  - -2 16 cores
  - Designed for small-scale parallelism & for sequential performance

#### • Many-cores:

- -10s 100s 1000s cores
- Design for large-scale parallelism ONLY!
- Sequential programs will run much slower!
  - If they can execute



#### Introduction



#### Control-flow Data-flow





#### Introduction







# What is parallelism?



#### Serial Processor

#### Parallel Processor



- Many processors executing instructions from the same program in parallel
- Instruction execution time is the same
- **Program execution time is less!!**





#### Motivation

- Many-core processing for increased parallelism
   HPC systems already include many-cores
- More parallelism  $\rightarrow$  More performance
- Many-cores today (and tomorrow?)
  - Cache-coherent Shared Memory
    - Can they scale?
  - Distributed Memory Clustered
    - Are they efficient for fine-grain parallelism?
- Software Parallel Programming Systems
  - Shared-memory
    - Good on multi-cores, not evaluated on many-cores
    - Need cache-coherence!
  - Distributed-memory
    - Good for large-scale distributed systems
    - Programming is becoming an impossible task











#### We need:

- Scalable Hardware:
  - A non-coherent shared-memory many-core
- Scalable Software:
  - Easy programming
  - Non-blocking execution
  - Doesn't need cache-coherence support
  - Exploits large amounts of fine-grain parallelism

# Scale performance on non-coherent shared-memory many-core processors

Many-core Design for Data-Flow Execution: Using the SWitches Prototype Implementation



**。** O



#### **Expected Contributions:**





- Non-coherent shared-memory many-core design
  - Simulation-based
  - With 100s of cores
- A software parallel programming system
  - Based on data-flow execution
  - Scale performance to 100s of cores
  - Validate to real hardware
  - Support for conventional programming APIs
    - OpenMP



## Outline



- Motivation
- Contributions
- Related Work
  - Data-flow-based Programming Systems
  - DDM Programming Systems
- SWitches Programming System
- Preliminary Work
- Roadmap Timeline



## Outline



- Motivation
- Contributions
- Related Work
  - Data-flow-based Programming Systems
  - DDM Programming Systems
- SWitches Programming System
- Preliminary Work
- Roadmap Timeline





University of Cyprus Department of Computer



University of Cyprus Department of Computer



University of Cyprus Department of Computer



University of Cyprus Department of Computer



| Characteristics of Dataflow Implementations | OmpSs                                                                                                                                              | SWARM                                                                                                                                                                                                                | Intel TBB                                                                                                                                                     |  |
|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Implementation (Software / Hardware)        | Software                                                                                                                                           | Software                                                                                                                                                                                                             | Software                                                                                                                                                      |  |
| Scheduling Policy (Static / Dynamic)        | Dynamic                                                                                                                                            | Dynamic                                                                                                                                                                                                              | Dynamic                                                                                                                                                       |  |
| Memory Model (Shared / Private)             | Shared / GPU                                                                                                                                       | Shared / Distributed                                                                                                                                                                                                 | Shared                                                                                                                                                        |  |
| Needs cache-coherency                       | Yes                                                                                                                                                | Yes                                                                                                                                                                                                                  | Yes                                                                                                                                                           |  |
| Number of cores/threads tested              | 24                                                                                                                                                 | 24                                                                                                                                                                                                                   | 8                                                                                                                                                             |  |
| Max Speedup achieved                        | depends on application                                                                                                                             | 8                                                                                                                                                                                                                    | 8                                                                                                                                                             |  |
| How dependencies are expressed              | directives in(), out(), inout()                                                                                                                    | C-macros API to represent<br>Codelets                                                                                                                                                                                | macro-API / explicit task<br>dependencies                                                                                                                     |  |
| Programming Language                        | C / C++                                                                                                                                            | С                                                                                                                                                                                                                    | C++                                                                                                                                                           |  |
| Contributions                               | Single programming model for<br>homogeneous & heterogeneous<br>architectures                                                                       | <ul> <li>Unified single-, multi-<br/>node interface,<br/>transparent to<br/>the programmer</li> </ul>                                                                                                                | <ul> <li>Parallel algorithms and<br/>data structures</li> <li>Scalable memory<br/>allocation and task<br/>scheduling</li> </ul>                               |  |
| Date                                        | 2011                                                                                                                                               | 2013                                                                                                                                                                                                                 | 2007                                                                                                                                                          |  |
| Notes                                       | <ul> <li>Based on StarSs and OpenMP</li> <li>Builds graph at runtime</li> <li>Task-dependency graph</li> <li>Each task is executed once</li> </ul> | <ul> <li>Difficult programming</li> <li>Work stealing across<br/>nodes and threads</li> <li>One scheduler on each<br/>thread/node</li> <li>Overhead of runtime<br/>observed for fine-grain<br/>scheduling</li> </ul> | <ul> <li>Rich feature set for<br/>general purpose<br/>parallelism</li> <li>Data-dependency graph</li> <li>Each task can execute<br/>multiple times</li> </ul> |  |



| Characteristics of Dataflow Implementations | OmpSs                                                                                                                                              | SWARM                                                                                                                                                                                                                | Intel TBB                                                                                                                                                     |  |
|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Implementation (Software / Hardware)        | Software                                                                                                                                           | Software                                                                                                                                                                                                             | Software                                                                                                                                                      |  |
| Scheduling Policy (Static / Dynamic)        | Dynamic                                                                                                                                            | Dynamic                                                                                                                                                                                                              | Dynamic                                                                                                                                                       |  |
| Memory Model (Shared / Private)             | Shared / GPU                                                                                                                                       | Shared / Distributed                                                                                                                                                                                                 | Shared                                                                                                                                                        |  |
| Needs cache-coherency                       | Yes                                                                                                                                                | Yes                                                                                                                                                                                                                  | Yes                                                                                                                                                           |  |
| Number of cores/threads tested              | 24                                                                                                                                                 | 24                                                                                                                                                                                                                   | 8                                                                                                                                                             |  |
| Max Speedup achieved                        | depends on application                                                                                                                             | 8                                                                                                                                                                                                                    | 8                                                                                                                                                             |  |
| How dependencies are expressed              | directives in(), out(), inout()                                                                                                                    | C-macros API to represent<br>Codelets                                                                                                                                                                                | t macro-API / explicit task<br>dependencies                                                                                                                   |  |
| Programming Language                        | C / C++                                                                                                                                            | С                                                                                                                                                                                                                    | C++                                                                                                                                                           |  |
| Contributions                               | Single programming model for<br>homogeneous & heterogeneous<br>architectures                                                                       | <ul> <li>Unified single-, multi-<br/>node interface,<br/>transparent to<br/>the programmer</li> </ul>                                                                                                                | <ul> <li>Parallel algorithms and<br/>data structures</li> <li>Scalable memory<br/>allocation and task<br/>scheduling</li> </ul>                               |  |
| Date                                        | 2011                                                                                                                                               | 2013                                                                                                                                                                                                                 | 2007                                                                                                                                                          |  |
| Notes                                       | <ul> <li>Based on StarSs and OpenMP</li> <li>Builds graph at runtime</li> <li>Task-dependency graph</li> <li>Each task is executed once</li> </ul> | <ul> <li>Difficult programming</li> <li>Work stealing across<br/>nodes and threads</li> <li>One scheduler on each<br/>thread/node</li> <li>Overhead of runtime<br/>observed for fine-grain<br/>scheduling</li> </ul> | <ul> <li>Rich feature set for<br/>general purpose<br/>parallelism</li> <li>Data-dependency graph</li> <li>Each task can execute<br/>multiple times</li> </ul> |  |



| Characteristics of Dataflow Implementations | OmpSs                                                                                                                                              | SWARM                                                                                                                                                                                                                | Intel TBB                                                                                                                                                     |  |
|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Implementation (Software / Hardware)        | Software                                                                                                                                           | Software                                                                                                                                                                                                             | Software                                                                                                                                                      |  |
| Scheduling Policy (Static / Dynamic)        | Dynamic                                                                                                                                            | Dynamic                                                                                                                                                                                                              | Dynamic                                                                                                                                                       |  |
| Memory Model (Shared / Private)             | Shared / GPU Shared / Distributed                                                                                                                  |                                                                                                                                                                                                                      | Shared                                                                                                                                                        |  |
| Needs cache-coherency                       | Yes                                                                                                                                                | Yes                                                                                                                                                                                                                  | Yes                                                                                                                                                           |  |
| Number of cores/threads tested              | 24                                                                                                                                                 | 24                                                                                                                                                                                                                   | 8                                                                                                                                                             |  |
| Max Speedup achieved                        | depends on application                                                                                                                             | 8                                                                                                                                                                                                                    | 8                                                                                                                                                             |  |
| How dependencies are expressed              | directives in(), out(), inout()                                                                                                                    | C-macros API to represent<br>Codelets                                                                                                                                                                                | macro-API / explicit task<br>dependencies                                                                                                                     |  |
| Programming Language                        | C / C++                                                                                                                                            | С                                                                                                                                                                                                                    | C++                                                                                                                                                           |  |
| Contributions                               | Single programming model for<br>homogeneous & heterogeneous<br>architectures                                                                       | <ul> <li>Unified single-, multi-<br/>node interface,<br/>transparent to<br/>the programmer</li> </ul>                                                                                                                | <ul> <li>Parallel algorithms and<br/>data structures</li> <li>Scalable memory<br/>allocation and task<br/>scheduling</li> </ul>                               |  |
| Date                                        | 2011                                                                                                                                               | 2013                                                                                                                                                                                                                 | 2007                                                                                                                                                          |  |
| Notes                                       | <ul> <li>Based on StarSs and OpenMP</li> <li>Builds graph at runtime</li> <li>Task-dependency graph</li> <li>Each task is executed once</li> </ul> | <ul> <li>Difficult programming</li> <li>Work stealing across<br/>nodes and threads</li> <li>One scheduler on each<br/>thread/node</li> <li>Overhead of runtime<br/>observed for fine-grain<br/>scheduling</li> </ul> | <ul> <li>Rich feature set for<br/>general purpose<br/>parallelism</li> <li>Data-dependency graph</li> <li>Each task can execute<br/>multiple times</li> </ul> |  |



| Characteristics of Dataflow Implementations | OmpSs                                                                                                                                              | SWARM                                                                                                                                                                                                                | Intel TBB                                                                                                                                                     |  |
|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Implementation (Software / Hardware)        | Software                                                                                                                                           | Software                                                                                                                                                                                                             | Software                                                                                                                                                      |  |
| Scheduling Policy (Static / Dynamic)        | Dynamic                                                                                                                                            | Dynamic                                                                                                                                                                                                              | Dynamic                                                                                                                                                       |  |
| Memory Model (Shared / Private)             | Shared / GPU Shared / Distributed                                                                                                                  |                                                                                                                                                                                                                      | Shared                                                                                                                                                        |  |
| Needs cache-coherency                       | Yes                                                                                                                                                | Yes                                                                                                                                                                                                                  | Yes                                                                                                                                                           |  |
| Number of cores/threads tested              | 24                                                                                                                                                 | 24                                                                                                                                                                                                                   | 8                                                                                                                                                             |  |
| Max Speedup achieved                        | depends on application                                                                                                                             | 8                                                                                                                                                                                                                    | 8                                                                                                                                                             |  |
| How dependencies are expressed              | directives in(), out(), inout()                                                                                                                    | in(), out(), inout() C-macros API to represent<br>Codelets                                                                                                                                                           |                                                                                                                                                               |  |
| Programming Language                        | C / C++                                                                                                                                            | С                                                                                                                                                                                                                    | C++                                                                                                                                                           |  |
| Contributions                               | Single programming model for<br>homogeneous & heterogeneous<br>architectures                                                                       | <ul> <li>Unified single-, multi-<br/>node interface,<br/>transparent to<br/>the programmer</li> </ul>                                                                                                                | <ul> <li>Parallel algorithms and<br/>data structures</li> <li>Scalable memory<br/>allocation and task<br/>scheduling</li> </ul>                               |  |
| Date                                        | 2011                                                                                                                                               | 2013                                                                                                                                                                                                                 | 2007                                                                                                                                                          |  |
| Notes                                       | <ul> <li>Based on StarSs and OpenMP</li> <li>Builds graph at runtime</li> <li>Task-dependency graph</li> <li>Each task is executed once</li> </ul> | <ul> <li>Difficult programming</li> <li>Work stealing across<br/>nodes and threads</li> <li>One scheduler on each<br/>thread/node</li> <li>Overhead of runtime<br/>observed for fine-grain<br/>scheduling</li> </ul> | <ul> <li>Rich feature set for<br/>general purpose<br/>parallelism</li> <li>Data-dependency graph</li> <li>Each task can execute<br/>multiple times</li> </ul> |  |



| Characteristics of Dataflow Implementations | OmpSs                                                                                                                                              | OmpSs SWARM                                                                                                                                                                                                          |                                                                                                                                                               |  |
|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Implementation (Software / Hardware)        | Software                                                                                                                                           | Software                                                                                                                                                                                                             | Software                                                                                                                                                      |  |
| Scheduling Policy (Static / Dynamic)        | Dynamic                                                                                                                                            | Dynamic                                                                                                                                                                                                              | Dynamic                                                                                                                                                       |  |
| Memory Model (Shared / Private)             | Shared / GPU                                                                                                                                       | Shared / Distributed                                                                                                                                                                                                 | Shared                                                                                                                                                        |  |
| Needs cache-coherency                       | Yes                                                                                                                                                | Yes                                                                                                                                                                                                                  | Yes                                                                                                                                                           |  |
| Number of cores/threads tested              | 24 24                                                                                                                                              |                                                                                                                                                                                                                      | 8                                                                                                                                                             |  |
| Max Speedup achieved                        | depends on application                                                                                                                             | 8                                                                                                                                                                                                                    | 8                                                                                                                                                             |  |
| How dependencies are expressed              | directives in(), out(), inout()                                                                                                                    | C-macros API to represent<br>Codelets                                                                                                                                                                                | macro-API / explicit task<br>dependencies                                                                                                                     |  |
| Programming Language                        | C / C++                                                                                                                                            | С                                                                                                                                                                                                                    | C++                                                                                                                                                           |  |
| Contributions                               | Single programming model for<br>homogeneous & heterogeneous<br>architectures                                                                       | <ul> <li>Unified single-, multi-<br/>node interface,<br/>transparent to<br/>the programmer</li> </ul>                                                                                                                | <ul> <li>Parallel algorithms and<br/>data structures</li> <li>Scalable memory<br/>allocation and task<br/>scheduling</li> </ul>                               |  |
| Date                                        | 2011                                                                                                                                               | 2013                                                                                                                                                                                                                 | 2007                                                                                                                                                          |  |
| Notes                                       | <ul> <li>Based on StarSs and OpenMP</li> <li>Builds graph at runtime</li> <li>Task-dependency graph</li> <li>Each task is executed once</li> </ul> | <ul> <li>Difficult programming</li> <li>Work stealing across<br/>nodes and threads</li> <li>One scheduler on each<br/>thread/node</li> <li>Overhead of runtime<br/>observed for fine-grain<br/>scheduling</li> </ul> | <ul> <li>Rich feature set for<br/>general purpose<br/>parallelism</li> <li>Data-dependency graph</li> <li>Each task can execute<br/>multiple times</li> </ul> |  |

#### **Conclusions on Data-flow Systems**

- Dynamic scheduling <sup>(all)</sup>
   Adds runtime overhead
- Centralized Runtime <sup>(OmpSs, TBB)</sup>
   Single-point of access
- Scale up to 24 cores <sup>(all)</sup>
  None is tested to more cores
- Runtime Dependency resolution (OmpSs)
  - Adds runtime overhead
- Difficult Programming (SWARM, TBB)
  - Macro-based programming
  - Programmer responsible for dependencies, updates, etc.
- Shared memory (OmpSs, TBB, SWARM)
  - But need hardware support for cache-coherence







Wait!

## The DDM Model

- Data-Driven Multithreading Model
  - Data-Flow execution
  - Thread-based
  - Synchronization (Data-flow) graph
    - No locks
    - No barriers
    - Only dependencies
  - Control-Flow execution within a thread
    - Exploit inherent architecture & compiler optimizations





# **DDM Programming Systems**



| Characteristics of<br>Dataflow<br>Implementations | D <sup>2</sup> NOW                                                              | TFlux                                                                                                                                                                                           | DDM-VM <sub>c</sub>                                                                                                      | DDM-VM <sub>s</sub>                        | DDM-VM <sub>d</sub>                          | DDM <sub>FPGA</sub>                       | TFluxSCC                                           | TFluxTM                                                                                                                     |
|---------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|----------------------------------------------|-------------------------------------------|----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| TSU Implementation<br>(Software / Hardware)       | Hardware                                                                        | Software Hardware                                                                                                                                                                               | Software                                                                                                                 | Software                                   | Software                                     | Hardware                                  | Software                                           | Software                                                                                                                    |
| Scheduling Policy<br>(Static / Dynamic)           | -                                                                               | Static                                                                                                                                                                                          | Static &<br>Dynamic                                                                                                      | Static &<br>Dynamic                        | Static                                       | Static &<br>Dynamic                       | Static                                             | Static                                                                                                                      |
| Memory Model<br>(Shared / Private)                | Distributed<br>Shared<br>Memory                                                 | Shared                                                                                                                                                                                          | Private                                                                                                                  | Shared                                     | Private                                      | Shared                                    | Shared Data<br>/ Distributed<br>Runtime            |                                                                                                                             |
| Needs cache-coherency                             | -                                                                               | Yes                                                                                                                                                                                             | No                                                                                                                       | Yes                                        | Yes                                          | Yes                                       | No                                                 | Yes                                                                                                                         |
| Number of cores/threads<br>tested                 | 32                                                                              | 6 27                                                                                                                                                                                            | 6 SPEs + 1 PPE                                                                                                           | 12                                         | 12 cores x 2<br>nodes                        | 8                                         | 48                                                 | 12                                                                                                                          |
| Max Speedup achieved                              | 26                                                                              | 5.9 25                                                                                                                                                                                          | Almost Linear                                                                                                            | 9.6                                        | 9.6 (SMT) /<br>16 (Distr)                    | 7.96                                      | 48                                                 | 6.2                                                                                                                         |
| How dependencies are<br>expressed                 | macros                                                                          | Directives                                                                                                                                                                                      | macros                                                                                                                   | macros                                     | macros                                       | macros                                    | Directives                                         | Directives                                                                                                                  |
| Contributions                                     | •<br>CacheFlow<br>• 1 <sup>st</sup> DDM<br>Simulated<br>Hardware<br>Distributed | <ul> <li>1<sup>st</sup> SMP Software<br/>implementation</li> <li>1<sup>st</sup> full system<br/>simulation</li> <li>Complete &amp; Portable</li> <li>Directive-based<br/>programming</li> </ul> | <ul> <li>1<sup>st</sup> heterog.</li> <li>Implement.</li> <li>Software</li> <li>CacheFlow</li> <li>Implement.</li> </ul> | • 1 <sup>st</sup><br>Software<br>CacheFlow | • 1 <sup>st</sup><br>software<br>distributed | • 1 <sup>st</sup><br>hardware<br>DDM FPGA | • 1 <sup>st</sup> many-<br>core<br>software<br>DDM | <ul> <li>1<sup>st</sup> integrat.</li> <li>of a DDM</li> <li>implementation</li> <li>with another</li> <li>model</li> </ul> |
| Date                                              | 2000                                                                            | 2008                                                                                                                                                                                            | 2011                                                                                                                     | 2010                                       | 2013                                         | 2014                                      | 2014 &<br>2015                                     | 2011 &<br>2015                                                                                                              |

# **Conclusions on DDM Systems**



- Centralized Runtime (all except TFluxSCC)
  - Single-point of communication
- Need hardware cache-coherence (all except TFluxSCC)
- Global SG (all except TFluxSCC)
  - Requires protection (e.g. locking)
    - Not scalable!
  - TFluxSCC uses one SG instance for every core
    - Memory expensive!

## Outline



- Motivation
- Contributions
- Related Work
  - Data-flow-based Programming Systems
  - DDM Programming Systems
- SWitches Programming System
- Preliminary Work
- Roadmap Timeline





Many-core Design for Data-Flow Execution: Using the SWitches Prototype Implementation

26

## Outline



- Motivation
- Contributions
- Related Work
  - Data-flow-based Programming Systems
  - DDM Programming Systems
- SWitches Programming System
- Preliminary Work
- Roadmap Timeline



#### From TFlux to SWitches - TFluxSoft





- Software implementation of the DDM model
  - For shared-memory multi-cores
  - Shared memory for updates exchange
- Global Synchronization Graph (SG)
  - Locking is required
  - Cache-coherence is required!
- Centralized scheduling unit (TSU)
  - Single-point of thread updates



#### Intel Single-chip Cloud Computing



- 48-core experimental processor
- Private **non-coherent** caches
- On-chip Message Passing Buffer (MPB)
- Shared off-chip main memory Uncacheable

- No locking!





**Management Console PC** 

Many-core Design for Data-Flow Execution: Using the SWitches Prototype Implementation

29

#### From TFlux to SWitches - TFluxSCC<sup>‡</sup>

<sup>‡</sup> Diavastos, Andreas, Giannos Stylianou, and Pedro Trancoso. "TFluxSCC: Exploiting Performance on Future Many-Core Systems through Data-Flow." *Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on*. IEEE, 2015.

- TFluxSCC Memory model
  - MPB for TSU updates
  - Shared off-chip for application data





University of Cyprus

**Department of Computer** 

Updates

- Simultaneous access is not allowed on shared data in DDM
- Caching global data is enabled in TFluxSCC
- Flush caches to ensure write-back is complete

# **TFluxSCC<sup>‡</sup> Runtime Implementation**



<sup>‡</sup> Diavastos, Andreas, Giannos Stylianou, and Pedro Trancoso. "TFluxSCC: Exploiting Performance on Future Many-Core Systems through Data-Flow." *Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on*. IEEE, 2015.



- Fully decentralized runtime system
- One TSU on every application thread
  - Multiple communication points
- Each core has it's own instance of SG
  - No locking





#### TFluxSoft on Intel Xeon Phi<sup>‡</sup>



- Intel Many Integrated Core Architecture
  - Knights Corner series
- 61 cores interconnected with on-die bidirectional ring
- 4-threaded cores (totaling 244 hardware threads)
- X86 ISA with 64-bit addressing
- 32 MB LLC shared, cache-coherent
  - 512KB L2 per core



<sup>‡</sup> We would like to thank The Cyprus Institute for letting use their Xeon Phi cluster for our experiments

#### Results: TFluxSCC & TFluxSoft





- TFluxSCC: Distributed TSU and local-SG
  - Multiple communication points for better scalability
  - Avoid monitoring & locking of shared data
- TFluxSoft: Centralized TSU and global-SG
  - Single point of synchronization
  - Contention on shared SG by all executing thread

24th March 2016



#### From TFlux to SWitches - TFluxTM<sup>‡</sup>



<sup>‡</sup>A. Diavastos, P. Trancoso, M. Lujan and I. Watson, "Integrating Transactions into the Data-Driven Multi-threading Model Using the TFlux Platform," *International Journal of Parallel Programming*, pp. 1-21, 2015

- Integrated Transactional Memory support
  - Flexibility of thread-based DDM
- Provide atomic operations
- Exploit runtime parallelisr
  - Using speculative parallel e
- Support more application



#### **Conclusions of TFlux experience**



#### • Positives:

- Software implementation
  - Portable across commodity hardware systems
- Thread-based granularity
  - Apply architectural & compiler optimization within threads
    - Pipelining, unrolling, vectorization, etc.
- Flexibility of DDM
  - Integrate multiple models in one system
- Shared-Memory model
  - Makes programming easier!
- All-the-time coherent execution
  - Data dependencies prohibit simultaneous access on shared data
  - No cache-coherence needed!!





#### **Conclusions of TFlux experience**



#### • Negatives:

- Large TSU structures (SG)
  - High memory footprint
  - Large initialization time
  - 65% of total execution time is used for TSU allocation & initialization
- Shared SG
  - Locking & Hardware coherence support
- Centralized TSU
  - Single-point of communication
  - 30% of time is spent in the TSU
    - This will increase with the number of cores
    - Most of the time it's not useful operations





## **SWitches:** Key Characteristics

- Connect Producers Consumers with switches
  - A unique switch for every producer
  - A producer turns ON it's switches when completed
  - A consumer starts when all it's producers switches are ON
  - Switches are boolean C/C++ data types

#### Decentralized SG & Runtime

- Single writer / Multiple readers (for every switch)
- No locking needed
- Low overhead notifications
- Use shared-memory
  - No cache-coherence support needed!









## **SWitches Evaluation:** 12-core multicore



- MMULT: Good scalability
  - TFlux looses performance at 12 cores (+1 TSU thread)
- RK4: TFlux looses performance
  - TFlux improves execution up to 8 cores
  - Too many TFlux threads
  - Too large SG shared among threads

24th March 2016

Many-core Design for Data-Flow Execution: Using the SWitches Prototype Implementation University of Cyprus Department of Computer

Science

### SWitches Evaluation: 60-core Xeon Phi



- MMULT: Good scalability
  - SWitches is slightly slower than OpenMP & TFlux (difference is negligible)
- RK4: TFlux is  $\approx 30 \times \text{slower}$ 
  - Centralized TSU becomes a communication bottleneck
  - Locking global SG with 240 threads is expensive!



## **SWitches Profiling: Lines of Code**





- TFlux produces on average 6× more code
  - Calls to the runtime system
  - More code to execute  $\rightarrow$  More execution time

## SWitches Profiling: Memory Usage





- The more threads we use the more memory each system needs
- TFlux on average requires more memory for SG
  - SG is created regardless of the # of cores used
  - SWitches creates SG info using the # of cores used (less cores  $\rightarrow$  less info)
  - Switches are less memory demanding (boolean)
  - OpenMP & SWitches use a chunk-based technique to increase/decrease granularity & reduce re-scheduling overheads

## SWitches Profiling: Scale beyond...





- TFlux: Locking of global SG & sharing among 512 threads!
- TFlux: Context switching interferes with centralized TSU (updates delayed)
- **OpenMP:** Also uses a centralized runtime
- SWitches: Decentralized runtime & local SG!!

24th March 2016

Many-core Design for Data-Flow Execution: Using the SWitches Prototype Implementation

42

00

## Outline



- Motivation
- Contributions
- Related Work
  - Data-flow-based Programming Systems
  - DDM Programming Systems
- SWitches Programming System
- Preliminary Work
- Roadmap Timeline





[1] Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. "Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform", International Journal of Parallel Programming (IJPP 2015).

[2] Andreas Diavastos, Giannos Stylianou and Pedro Trancoso "TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow". In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015).

[3] Panayiotis Petrides, Andreas Diavastos, Constantinos Christofi and Pedro Trancoso. "Scalability and Efficiency of Database Queries on Future Many-core Systems". In the Proceedings of the 21st Euromicro International Conference, Distributed and Network-based Processing (PDP 2013).

[4] Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao, Pedro Trancoso. "LDPC Decoding on the Intel SCC", In the Proceedings of the 20th Euromicro International Conference, Distributed and Network-based Processing (PDP 2012).

## Thank You!



#### **CASPER Group**

#### Visit Us: <u>www.cs.ucy.ac.cy/carch/casper</u>

#### Computer Architecture, Systems and Performance Evaluation Research

#### List of Publications:

#### Journals:

- 1. Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. "Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform", International Journal of Parallel Programming (IJPP 2015).
- 2. Andreas Diavastos, Giannos Stylianou, Giannis Koutsou. "Exploiting Very-Wide Vector Processing for Scientific Applications". (Article) In Computing in Science & Engineering (CiSE), Vol. 17, no. 6, Nov/Dec 2015, pp. 83.87.

#### **Conferences & Workshops:**

- **3.** Andreas Diavastos, Giannos Stylianou and Pedro Trancoso *"TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow"*. In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015).
- 4. Andreas Diavastos, Giannos Stylianou, Pedro Trancoso. "TFluxSCC: A Case Study for Exploiting Performance in Future Many-core Systems". (Poster Paper) In the Proceedings of the 11th ACM Conference on Computing Frontiers, Cagliari, Italy, May 2014.
- 5. Panayiotis Petrides, **Andreas Diavastos**, Constantinos Christofi and Pedro Trancoso. *"Scalability and Efficiency of Database Queries on Future Manycore Systems"*. In the Proceedings of the 21st Euromicro International Conference, Distributed and Network-based Processing (PDP 2013).
- 6. Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao, Pedro Trancoso. *"LDPC Decoding on the Intel SCC"*, In the Proceedings of the 20th Euromicro International Conference, Distributed and Network-based Processing (PDP 2012).
- 7. Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. *"Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform"*, In the Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM 2011).
- 8. Andreas Diavastos, Giannos Stylianou and Giannis Koutsou "Exploiting Very-Wide Vectors on Intel Xeon Phi with Lattice-QCD kernels". In the Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016).

# **Backup Slides**

## **Serialization Sets**







## **Future Direction of Work**



- Implement & test more applications
- Parallel threads acceleration
  - GPUs
  - FPGAs
- Test SWitches on HPC systems
  - In collaboration with message-passing systems
  - Performance scalability on multi-node systems
  - e.g. SWitches for intra-node & MPI for inter-node
  - Test real HPC scientific applications

## Example execution of SWitches





## Example execution of SWitches





